The bulk of my workload this week went to developing a visualization method for looking at the tweets over time and working towards implementing sentiment analysis directly into Elasticsearch. For the visualization, I have committed an R function called plot_tweet_sentiment_timeseries.R to /COVID-Twitter/analysis on GitHub from my branch hacl-campoh, which takes a dataframe of tweets and returns a ggplot object with the visualization of sentiments over time (shown later in this notebook). I’m currently working with Abraham and Rachel on the backend to attach sentiment information to the tweets from Elasticsearch and greatly shorten the time we must spend analyzing sentiment related data.
Our plan is to firstly use the vaderSentiment Python implementation of VADER and assign sentiment scores to each Tweet in the database. We then plan to use the Elasticsearch aggregation methods to quickly retrieve tweet sentiment data over time.
Furthermore, we also plan of upgrading from VADER to a pretrained BERT model in the future with a change to the last layer, so as to give continuous sentiment score results in the same scale as VADER.
hacl-campoh/COVID-Twitter/analysis: R function for visualizing tweet sentiment trends over time
plot_tweet_sentiment_timeseries.ROne of my personal contributions has been the visualization function plot_tweet_sentiment_timeseries.R. It allows the user to visualize the count/trendiness of a given tweet dataframe along with its sentiment over time, with the tweets being able to be grouped either by day or CDC epidemiological week. Additionally, it also includes a measurement for divisiveness of sentiment based on the Sarle’s Bimodal Coefficient, given by \(divisiveness(X) = \mathrm{logit}((\mathrm{Skew}[X]^2 + 1) / \mathrm{Kurt}[X]) + \log(4/5)\). A score of 0 suggests neither division nor consensus of sentiment, but rather a uniform distribution, while a score \(<0\) indicates consensus around a certain sentiment level, such as a truncated laplace distribution, and a score \(>0\) indicates bimodality of the sentiment distribution and thus a level of division. The function allows for the plotting of the moving averages of these statistics.
Another idea of mine was to add an additional non-trainable layer to a pretrained sentiment classification models to achieve a continuous output. Upon bringing up the issue that VADER, being a lexicon based NLP technique, could run into tweets without any known words and then return errors, Abraham suggested using a pretrained BERT model.
Since BERT uses a general tokenization technique, this would eliminate the problem, however, BERT models pretrained on tweets perform sentiment classification, i.e. POSITIVE, NEUTRAL or NEGATIVE, as opposed to returning a continuous sentiment score.
Letting \(\mathbf{y}\) be the softmax activated output later, I suggested taking an approach similar to the last step of the VADER algorithm and instead of returning the class \(c = \underset{i}{\operatorname{argmax}} \mathbf{y}_i\), we return the score \(s = \mathbf{y} \cdot [-1, 0, 1]\). This is equivalent to returning the expected value of the class random variable, \(-1\) being NEGATIVE, \(0\) being NEUTRAL and \(1\) being POSITIVE.
When writing plot_tweet_sentiment_timeseries.R, I was interested in what kinds of different sentiment trends each different clusters displayed over time. Therefore, to demonstrate its behavior we’ll apply it to a sample from coronavirus-data-masks and its clusters as determined by k-means. We take 10,000 random tweets from the coronavirus-data-masks dataset with dates ranging from January 1st, 2020, to August 1st, 2020.
###############################################################################
# Get the tweets from Elasticsearch using the search parameters defined above
###############################################################################
results <- do_search(indexname="coronavirus-data-masks",
rangestart=rangestart,
rangeend=rangeend,
text_filter=text_filter,
semantic_phrase=semantic_phrase,
must_have_embedding=TRUE,
random_sample=random_sample,
resultsize=resultsize,
resultfields='"created_at", "user.screen_name", "user.verified", "user.location", "place.full_name", "place.country", "text", "full_text", "extended_tweet.full_text", "created_at", "embedding.use_large.primary", "dataset_file", "dataset_entry.annotation.part1.Response", "dataset_entry.annotation.part2-opinion.Response"',
elasticsearch_host="lp01.idea.rpi.edu",
elasticsearch_path="elasticsearch",
elasticsearch_port=443,
elasticsearch_schema="https")
# this dataframe contains the tweet text and other metadata
tweet.vectors.df <- results$df[,c("full_text", "user_screen_name", "user_verified", "user_location", "place.country", "place.full_name", "created_at")]
# this matrix contains the embedding vectors for every tweet in tweet.vectors.df
tweet.vectors.matrix <- t(simplify2array(results$df[,"embedding.use_large.primary"]))
###############################################################################
# Clean the tweet and user location text, and set up tweet.vectors.df
# the way we want it by consolidating the location field and computing
# location type
###############################################################################
tweet.vectors.df$user_location <- ifelse(is.na(tweet.vectors.df$place.full_name), tweet.vectors.df$user_location, paste(tweet.vectors.df$place.full_name, tweet.vectors.df$place.country, sep=", "))
tweet.vectors.df$user_location[is.na(tweet.vectors.df$user_location)] <- ""
tweet.vectors.df$user_location_type <- ifelse(is.na(tweet.vectors.df$place.full_name), "User", "Place")
clean_text <- function(text, for_freq=FALSE) {
text <- str_replace_all(text, "[\\s]+", " ")
text <- str_replace_all(text, "http\\S+", "")
if (isTRUE(for_freq)) {
text <- tolower(text)
text <- str_replace_all(text, "’", "'")
text <- str_replace_all(text, "_", "-")
text <- str_replace_all(text, "[^a-z1-9 ']", "")
} else {
text <- str_replace_all(text, "[^a-zA-Z1-9 `~!@#$%^&*()-_=+\\[\\];:'\",./?’]", "")
}
text <- str_replace_all(text, " +", " ")
text <- trimws(text)
}
tweet.vectors.df$full_text <- sapply(tweet.vectors.df$full_text, clean_text)
tweet.vectors.df$user_location <- sapply(tweet.vectors.df$user_location, clean_text)
In order to determine a good number of clusters, we use the “elbow method” in the absence of a more automated technique, subjectively selecting 17 clusters to be used.
wssplot <- function(data, fc=1, nc=40, seed=20){
wss <- data.frame(k=fc:nc, withinss=c(0))
for (i in fc:nc){
set.seed(seed)
wss[i-fc+1,2] <- sum(kmeans(data, centers=i, iter.max=30)$withinss)}
ggplot(data=wss,aes(x=k,y=withinss)) +
geom_line() +
ggtitle("Quality (within sums of squares) of k-means by choice of k")
}
wssplot(tweet.vectors.matrix)
Next, we add a new column to the dataframe of tweets consisting of the VADER compound sentiment score to speed up the visualization function’s running time. We must note that this was written before Abraham’s implementation of VADER sentiment directly into Elasticsearch, which immensely speeds this part of the process
####################################################
# Compute and attach tweet sentiment to each tweet
####################################################
tweet.vectors.df$sentiment <- c(0)
sentiment.vector <- rep(NA, length(tweet.vectors.df$sentiment))
for (i in 1:length(tweet.vectors.df$sentiment)) {
tryCatch({
sentiment.vector[i] <- get_vader(tweet.vectors.df$full_text[i])["compound"]
}, error = function(e) {
sentiment.vector[i] <- NA
})
}
#sentiment.vector <- vader_df(tweet.vectors.df$full_text)[,"compound"]
tweet.vectors.df$sentiment <- sentiment.vector
tweet.vectors.df <- tweet.vectors.df[!is.na(sentiment.vector),]
tweet.vectors.matrix <- tweet.vectors.matrix[!is.na(sentiment.vector),]
We then cluster the tweet embedding using k-means. For the purposes of demonstration, we do not perform subclustering since in the absence of some technique for automatically identifying topics, the topics contained in the high level clusters are easier to interpret than the sparser subclusters from a word frequency point of view.
###############################################################################
# Run K-means on all the tweet embedding vectors
###############################################################################
# Number of clusters
k <- 17
set.seed(300)
km <- kmeans(tweet.vectors.matrix, centers=k, iter.max=30)
tweet.vectors.df$vector_type <- factor("tweet", levels=c("tweet", "cluster_center", "subcluster_center"))
tweet.vectors.df$cluster <- as.factor(km$cluster)
#append cluster centers to dataset for visualization
centers.df <- data.frame(full_text=paste("Cluster (", rownames(km$centers), ") Center", sep=""),
user_screen_name="[N/A]",
user_verified="[N/A]",
user_location="[N/A]",
user_location_type = "[N/A]",
place.country = "[N/A]",
place.full_name = "[N/A]",
created_at = "[N/A]",
vector_type = "cluster_center",
cluster=as.factor(rownames(km$centers)),
sentiment=NA)
tweet.vectors.df <- rbind(tweet.vectors.df, centers.df)
tweet.vectors.matrix <- rbind(tweet.vectors.matrix, km$centers)
Next, we find the most common words in each cluster so as to better be able to interpret them. Ideally, this would be replaced by some more automatic method in the future.
###############################################################################
# Compute labels for each cluster based on word frequency
# and identify the nearest neighbors to each cluster center
###############################################################################
stop_words <- stopwords("en", source="snowball")
stop_words <- union(stop_words, stopwords("en", source="nltk"))
stop_words <- union(stop_words, stopwords("en", source="smart"))
stop_words <- union(stop_words, stopwords("en", source="marimo"))
stop_words <- union(stop_words, c(",", ".", "!", "-", "?", "&", "amp"))
get_word_freqs <- function(full_text) {
word_freqs <- table(unlist(strsplit(clean_text(full_text, TRUE), " ")))
word_freqs <- cbind.data.frame(names(word_freqs), as.integer(word_freqs))
colnames(word_freqs) <- c("word", "count")
word_freqs <- word_freqs[!(word_freqs$word %in% stop_words),]
word_freqs <- word_freqs[order(word_freqs$count, decreasing=TRUE),]
}
get_label <- function(word_freqs, exclude_from_labels=NULL, top_k=3) {
words <- as.character(word_freqs$word)
exclude_words <- NULL
if (!is.null(exclude_from_labels)) {
exclude_words <- unique(unlist(lapply(strsplit(exclude_from_labels, "/"), trimws)))
}
label <- paste(setdiff(words, exclude_words)[1:top_k], collapse=" / ")
}
get_nearest_center <- function(df, mtx, center) {
df$center_cosine_similarity <- apply(mtx, 1, function(v) (v %*% center)/(norm(v, type="2")*norm(center, type="2")))
nearest_center <- df[order(df$center_cosine_similarity, decreasing=TRUE),]
nearest_center <- nearest_center[nearest_center$vector_type=="tweet", c("center_cosine_similarity", "full_text", "user_location")]
}
master.word_freqs <- get_word_freqs(tweet.vectors.df$full_text)
master.label <- get_label(master.word_freqs, top_k=6)
clusters <- list()
for (i in 1:k) {
cluster.df <- tweet.vectors.df[tweet.vectors.df$cluster == i,]
cluster.matrix <- tweet.vectors.matrix[tweet.vectors.df$cluster == i,]
cluster.word_freqs <- get_word_freqs(cluster.df$full_text)
cluster.label <- get_label(cluster.word_freqs, master.label)
cluster.center <- cluster.matrix[cluster.df$vector_type=="cluster_center",]
cluster.nearest_center <- get_nearest_center(cluster.df, cluster.matrix, cluster.center)
clusters[[i]] <- list(word_freqs=cluster.word_freqs, label=cluster.label, nearest_center=cluster.nearest_center)
}
We now visualize the clusters and their sentiments using t-SNE and the plot_tweet_sentiment_timeseries.R function. The first sentiment time series plot relates to the entire sample, while the following ones correspond to the clusters in ascending order (I plan on implementing a custom title functionality very soon after encountering this inconvenience).
###############################################################################
# Run T-SNE on all the tweets and then plot sentiment time series for clusters
###############################################################################
set.seed(700)
tsne <- Rtsne(tweet.vectors.matrix, dims=2, perplexity=25, max_iter=750, check_duplicates=FALSE)
tsne.plot <- cbind(tsne$Y, tweet.vectors.df)
colnames(tsne.plot)[1:2] <- c("X", "Y")
tsne.plot$full_text <- sapply(tsne.plot$full_text, function(t) paste(strwrap(t ,width=60), collapse="<br>"))
tsne.plot$cluster.label <- sapply(tsne.plot$cluster, function(c) clusters[[c]]$label)
cluster.sentiment.plots <- list()
#Master high level plot
fig.master <- plot_ly(tsne.plot, x=~X, y=~Y,
text=~paste("Cluster:", cluster,"<br>Text:", full_text),
color=~cluster.label, type="scatter", mode="markers")
fig.master <- fig.master %>% layout(title=paste("Master Plot:", master.label, "(high level clusters)"),
yaxis=list(zeroline=FALSE),
xaxis=list(zeroline=FALSE))
fig.master <- fig.master %>% toWebGL()
fig.master
#Master level tweet sentiment by day plot for the entire sample
fig.master.sentiment <- plot_tweet_sentiment_timeseries(tweet.vectors.df, group.by = "week", plot.ma = TRUE)
#Cluster sentiment plots
for (i in 1:k) {
print(paste("Plotting cluster", i, " sentiment time series..."))
fig <- plot_tweet_sentiment_timeseries(tweet.vectors.df[tsne.plot$cluster == i,], group.by = "week", plot.ma = TRUE)
cluster.sentiment.plots[[i]] <- fig
}
## [1] "Plotting cluster 1 sentiment time series..."
## [1] "Plotting cluster 2 sentiment time series..."
## [1] "Plotting cluster 3 sentiment time series..."
## [1] "Plotting cluster 4 sentiment time series..."
## [1] "Plotting cluster 5 sentiment time series..."
## [1] "Plotting cluster 6 sentiment time series..."
## [1] "Plotting cluster 7 sentiment time series..."
## [1] "Plotting cluster 8 sentiment time series..."
## [1] "Plotting cluster 9 sentiment time series..."
## [1] "Plotting cluster 10 sentiment time series..."
## [1] "Plotting cluster 11 sentiment time series..."
## [1] "Plotting cluster 12 sentiment time series..."
## [1] "Plotting cluster 13 sentiment time series..."
## [1] "Plotting cluster 14 sentiment time series..."
## [1] "Plotting cluster 15 sentiment time series..."
## [1] "Plotting cluster 16 sentiment time series..."
## [1] "Plotting cluster 17 sentiment time series..."
word.freq.list <- htmltools::tagList()
for (i in 1:k) {
# Print cluster word frequencies
if (isTRUE(show_word_freqs)) {
word.freq.list[[i]] <- htmltools::HTML(kable(clusters[[i]]$word_freqs[1:10,], caption=paste("Cluster", i, "Top 10 Words")) %>% kable_styling())
}
}
word.freq.list
| word | count | |
|---|---|---|
| 572 | corona | 475 |
| 1581 | mask | 465 |
| 2865 | wear | 124 |
| 2809 | virus | 97 |
| 2870 | wearing | 83 |
| 1913 | people | 80 |
| 914 | face | 61 |
| 1588 | masks | 32 |
| 2382 | social | 28 |
| 1199 | home | 23 |
| word | count | |
|---|---|---|
| 4103 | virus | 744 |
| 2376 | mask | 725 |
| 4182 | wear | 327 |
| 2868 | people | 208 |
| 4188 | wearing | 176 |
| 3614 | stay | 97 |
| 1819 | home | 82 |
| 901 | covid19 | 79 |
| 3644 | stop | 79 |
| 3589 | spread | 78 |
| word | count | |
|---|---|---|
| 2492 | mask | 573 |
| 979 | covid | 246 |
| 4387 | wear | 243 |
| 981 | covid19 | 211 |
| 2971 | people | 117 |
| 4396 | wearing | 117 |
| 2502 | masks | 83 |
| 2951 | patients | 82 |
| 1904 | hospital | 71 |
| 4468 | work | 54 |
| word | count | |
|---|---|---|
| 2183 | mask | 631 |
| 827 | coronavirus | 305 |
| 865 | covid19 | 255 |
| 1293 | face | 170 |
| 3893 | wearing | 143 |
| 3884 | wear | 118 |
| 616 | cases | 66 |
| 2163 | mandate | 61 |
| 2804 | public | 55 |
| 2192 | masks | 54 |
| word | count | |
|---|---|---|
| 1950 | mask | 473 |
| 810 | covid19 | 356 |
| 3439 | wear | 306 |
| 1185 | face | 208 |
| 2988 | stay | 146 |
| 1431 | hands | 143 |
| 2918 | social | 139 |
| 1514 | home | 103 |
| 2965 | spread | 103 |
| 3415 | wash | 102 |
| word | count | |
|---|---|---|
| 2322 | mask | 798 |
| 912 | covid19 | 424 |
| 1371 | face | 297 |
| 860 | coronavirus | 289 |
| 2350 | masks | 138 |
| 1378 | facemask | 91 |
| 910 | covid | 73 |
| 2276 | make | 50 |
| 4064 | wear | 48 |
| 3990 | virus | 45 |
| word | count | |
|---|---|---|
| 667 | covid | 551 |
| 1733 | mask | 541 |
| 3059 | wear | 168 |
| 3066 | wearing | 98 |
| 2099 | people | 73 |
| 17 | 19 | 44 |
| 995 | face | 33 |
| 1738 | masks | 29 |
| 2251 | put | 24 |
| 723 | damn | 21 |
| word | count | |
|---|---|---|
| 1918 | n95 | 424 |
| 1775 | masks | 301 |
| 703 | covid19 | 175 |
| 1772 | mask | 163 |
| 659 | coronavirus | 100 |
| 3168 | virus | 75 |
| 702 | covid | 52 |
| 2252 | ppe | 46 |
| 2118 | pandemic | 42 |
| 2847 | surgical | 38 |
| word | count | |
|---|---|---|
| 1223 | covid19 | 315 |
| 3119 | mask | 180 |
| 5485 | wearamask | 178 |
| 1170 | coronavirus | 109 |
| 1222 | covid | 74 |
| 3764 | people | 69 |
| 903 | cases | 55 |
| 77 | 2 | 41 |
| 5482 | wear | 39 |
| 12 | 1 | 36 |
| word | count | |
|---|---|---|
| 2840 | mask | 1116 |
| 4944 | virus | 821 |
| 2852 | masks | 334 |
| 5027 | wearing | 330 |
| 5022 | wear | 302 |
| 3462 | people | 226 |
| 4354 | spread | 152 |
| 3672 | protect | 149 |
| 4408 | stop | 118 |
| 1617 | face | 107 |
| word | count | |
|---|---|---|
| 2555 | mask | 1083 |
| 4505 | wear | 544 |
| 943 | covid19 | 434 |
| 4514 | wearing | 404 |
| 3039 | people | 236 |
| 942 | covid | 199 |
| 2572 | masks | 197 |
| 896 | coronavirus | 160 |
| 1467 | face | 152 |
| 3271 | public | 76 |
| word | count | |
|---|---|---|
| 2322 | mask | 722 |
| 2726 | pandemic | 686 |
| 4056 | wear | 287 |
| 4064 | wearing | 205 |
| 2762 | people | 189 |
| 1325 | face | 76 |
| 875 | covid19 | 74 |
| 2335 | masks | 66 |
| 2950 | public | 50 |
| 834 | coronavirus | 49 |
| word | count | |
|---|---|---|
| 1831 | mask | 510 |
| 3134 | trump | 426 |
| 3352 | wear | 189 |
| 686 | covid19 | 183 |
| 657 | coronavirus | 159 |
| 3356 | wearing | 122 |
| 3279 | virus | 102 |
| 2160 | pandemic | 74 |
| 685 | covid | 64 |
| 2204 | pence | 62 |
| word | count | |
|---|---|---|
| 2766 | mask | 655 |
| 4829 | virus | 462 |
| 4914 | wear | 239 |
| 3351 | people | 184 |
| 4918 | wearing | 168 |
| 3267 | pandemic | 96 |
| 1060 | covid19 | 95 |
| 771 | cdc | 82 |
| 2777 | masks | 82 |
| 4660 | trump | 76 |
| word | count | |
|---|---|---|
| 2003 | mask | 527 |
| 2557 | quarantine | 161 |
| 3586 | wearing | 120 |
| 783 | covid19 | 115 |
| 3579 | wear | 106 |
| 3518 | virus | 78 |
| 748 | coronavirus | 67 |
| 1175 | face | 64 |
| 782 | covid | 55 |
| 2381 | people | 55 |
| word | count | |
|---|---|---|
| 1395 | mask | 279 |
| 1325 | lockdown | 257 |
| 1696 | people | 79 |
| 2464 | wear | 70 |
| 2469 | wearing | 66 |
| 523 | covid19 | 37 |
| 2079 | social | 36 |
| 799 | face | 30 |
| 648 | distancing | 28 |
| 1404 | masks | 25 |
| word | count | |
|---|---|---|
| 2714 | mask | 578 |
| 1040 | covid19 | 292 |
| 4856 | wear | 197 |
| 1039 | covid | 171 |
| 4863 | wearing | 141 |
| 3284 | people | 124 |
| 1002 | coronavirus | 70 |
| 3604 | realdonaldtrump | 57 |
| 4588 | trump | 55 |
| 4104 | social | 54 |
Here we see very different trends in count, sentiment and divisiveness for different clusters. We highlight
Clusters 5 and 6 : People talking positively about mask usage, quarantining and social distancing, with many giving advice on mask usage and social distancing. It does seem, however, that cluster 9, which has a decreasing count MA but high absolute count, is mostly company mask advertisements and news reports, which could maybe suggest corporations exploiting the early weeks of the pandemic to advertise masks or social distancing related products. We also note that the tweet count MA for cluster 5, which is mostly people in support of mask usage and social distancing, follows a trend similar to the national death counts, although this would have to be more rigorously tested.
Cluster 10 : People talking about negatively about mask usage, although this seems to division between people in opposition to mask policy and people angry at those violating mask policy, which is reflected in the consistently positive divisiveness score. The tweet count for this cluster also has a consistent increasing trend, suggesting increasing antimask sentiment and a corresponding increase it opposition to antimaskers.
Cluster 13 : People complaining about the Trump administration’s management of the crisis and Trump not following proper mask protocols. This cluster has a clearly increasing trend, reflecting the increasing national dissatisfaction with the Trump administration’s handling of the situation
Cluster 4 : People discussing COVID related government actions and mandates. The meaning of negative sentiment here is harder to interpret, as upon direct inspection it seems to come from people’s general dread of the situation or anger at government mandates.
Cluster 14 : Discussion on the effectiveness of masks and social distancing mandates in general. The positive divisiveness shows how this cluster seems to be split between two populations, those who doubt the effectiveness of masks and those trying to convince people that wearing masks helps combat the virus. Like cluster 10, here we see a clear increasing trend in tweet counts, which suggest more and more people are doubting the effectiveness of masks compared to the start of the pandemic.
Cluster 12 : People commenting on their personal experience with how they and others are following mask requirements. Note how the MA of sentiment shifts from positive to negative as the pandemic goes on, which from direct inspection might be the result of people increasingly complaining about others not wearing masks.
Thus, we observe that although the distribution of sentiments across all tweets over time does not seem to give much useful information, the clusters returned by k-means do not only possess very distinctive trends in sentiment but may also be interpreted to arrive at claims about the public’s sentiment towards mask usage, social distancing, mask madates, etc.